Background: Large-scale biological jobs on high-performance computing systemsrequire manual intervention if one or more computing cores on which theyexecute fail. This places not only a cost on the maintenance of the job, butalso a cost on the time taken for reinstating the job and the risk of losingdata and execution accomplished by the job before it failed. Approaches whichcan proactively detect computing core failures and take action to relocate thecomputing core's job onto reliable cores can make a significant step towardsautomating fault tolerance. Method: This paper describes an experimental investigation into the use ofmulti-agent approaches for fault tolerance. Two approaches are studied, thefirst at the job level and the second at the core level. The approaches areinvestigated for single core failure scenarios that can occur in the executionof parallel reduction algorithms on computer clusters. A third approach isproposed that incorporates multi-agent technology both at the job and corelevel. Experiments are pursued in the context of genome searching, a popularcomputational biology application. Result: The key conclusion is that the approaches proposed are feasible forautomating fault tolerance in high-performance computing systems with minimalhuman intervention. In a typical experiment in which the fault tolerance isstudied, centralised and decentralised checkpointing approaches on an averageadd 90% to the actual time for executing the job. On the other hand, in thesame experiment the multi-agent approaches add only 10% to the overallexecution time.
展开▼